15.3 Binary search algorithm and data structure


Have you ever needed to quickly search for something in a sorted sequence of things in the real world? What about searching for a hotel room in a big hotel with long hallways that go in a circle around the elevator? Or better yet, imagine you are driving at night in the rain in a big subdivision, and you need to get out of the car to shine a flashlight on the curb to check the street numbers painted on the curb, looking for a particular address. You are in a hurry, and you don’t want to get wet, so you need to minimize the number of times you have to pull over and get out of your car. If you’re ever in this situation, the binary search algorithm will come in handy. You may have even already figured out this algorithm for yourself, and after this lesson you will know its name and how to implement it in Python.

One of the most basic algorithms that you will need for any website or database is the index or search algorithm. A search algorithm gives you a way to find the position (index) of a value within a sequence such as a list or tuple. And you need an algorithm that can find that index in seconds, even if you have billions of objects to search through. This lesson will show you how binary search works behind the scenes to keep the Internet humming along, blazingly fast.

All databases and websites (and almost all Python applications), need a way to find a value hidden inside a long list of values. In this lesson you will see how the most popular search algorithm works. It’s called “binary search.” The binary search algorithm is one of those algorithms that you may be asked about if you interview for a Python programmer job. The binary search algorithm is the basis of many search engines and databases that you use every day. But first you should learn why searching a list is so important.

For this module you will learn all about two different approaches to finding the position (index) of an object in a list.

If your boss asks you to write a Python function to find the position of a value in a list, what would you do? Well, the first thing to do is ask your boss to give you some examples. Here are two examples that your boss has given you:

# find the position of the value 4 within a sequence of integers
>>> sorted_list = [1, 2, 3, 4, 5]
>>> find_position(sorted_list, 4)
3
>>> sorted_list[3]
4

That sorted_list variable looks like a range(1, 6) list. So you are proud of yourself when you show off your super-fast function that just subtracts one from the value to get the correct position in a Python list:

def find_position(sorted_list, value):
    return value - 1

This looks a little wrong to you, because you aren’t using that first argument, named sorted_list at all. But when you test it on several more examples it works just fine, so you show it to your boss anyway. When your boss sees it on GitLab comment: “THAT’S NOT WHAT I MEANT! The user’s integer sequence can have gaps!” So you calmly ask them to provide another example to explain what they want. Your boss gives you another test example that looks like this:

>>> find_position([2, 4, 6, 8], 4)
1
>>> find_position([2, 4, 6, 8], 5)
None
>>> find_position([16, 17, 18, 19, 20], 20)
4

Looking at these test examples you realize you need to examine all the values in the input list if you want to be able to find the correct positions.

So for your next attempt you iterate through the lists and check each value against the value requested by the user.

def find_position(sorted_list, value):
    i = 0
    for x in sorted_list:
        if x == value:
            return i
        i = i + 1
    return None

This is similar to the accumulator pattern, but you are accumulating the position in an array instead of a sum or product. Can you think of a way to simplify this code using the += operator? What about using the built-in enumerate function to help you keep track of i for the integer position in your for loop?

Here’s another way to write this same function:

def find_position(sorted_list, value):
    for i, x in enumerate(sorted_list):
        if x == value:
            return i

The enumerate function turns a sequence into a sequence of 2-tuples. The index for each position in a sequence (i) is returned as the first value in the 2-tuple, and the value (x) comes second. Also, notice that if you do not return something at the end of function, Python will automatically return the value None for you.

When you give this to your boss they say it works great, and then they deploy your code to a website. Your company’s website allows users to upload sequences of numbers and plot them in pretty pictures. But a few days later your boss comes back to you and says users are starting to complain that your function takes too long when they give it long sequences of sorted values. For example, one user works for NOAA where they are modeling climate change. They have a sequence of global temperatures for every minute since 1880.

NOAA global temperatures reported by four different sources all agree, it's getting hot very very fast

And here is what the minute-by-minute log of data from a temperature sensor might look like:

>>> dt = datetime(1880, 1, 1)
... minute = timedelta(seconds=60)
... templogs = []
... while dt < datetime(2024,5,10):
...     templogs.append(
...         f"{dt.year:04d}-{dt.month:02d}-{dt.day:02d}" 
...         f" {dt.hour:02d}:{dt.minute:02d}"
...         f" {(dt - start).total_seconds() * 60 / 250_000_000:02.2f} degC"
...         )
...     dt = dt + minute
>>> templogs[-2:]
['2024-04-26 03:01 25.01 degC', '2024-04-26 03:01 25.02 degC']
>>> timestamps = [x[:16] for x in templogs]
>>> len(timestamps)
75736800

>>> find_position(timestamps, '2024-04-26 03:01 25.01 degC')
75736799

This NOAA scientist has collected almost 76 million data points for global temperatures since 1880! Unfortunately when they want find the position of a recent temperature log entry, near the end of the sequence, it takes almost a minute, and their website freezes up. This means the NOAA website can’t display those fancy animations of global warning. So you need to modify your function so that it can work faster with a lot more data.

Fortunately your boss gives you some example strings in a sorted list of integers to imitate the data on the NASA website:

>>> from datetime import datetime
>>> date_difference = datetime(2024, 1, 1) - datetime(1880, 1, 1)
>>> minutes = list(range(int(date_difference.total_seconds() / 60)))

First, use the iPython (Jupyter Console) magic function %timeit to see just how fast (or slow) your function is:

>>> %timeit find_position(minutes, 75736700)

And here’s that same test again, using Python’s built-in index method. All lists and tuples have an index method that do the same thing as your find_position() function.

>>> %timeit minutes.index(minutes[-1])
501 ms ± 2.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit minutes.index(minutes[75_000_000])
495 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit minutes.index(minutes[0])
79.3 ns ± 0.749 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each) 

So the Python core developers have figured out a way to speed things up by about 8x. The built-in Python list.index() took 500 ms (half a second) for the values towards the end of the list. Your loop took about 4 seconds, or 8 times as long. Most Python methods will be much faster than your pure Python implementations, even if you use the exact same algoirthms, because almost all of the core Python built-ins are implemented in compiled C libraries.

It looks like the .index() method uses a brute force approach, just like yours. The longer the list, the longer it takes to find a value near the end of that list. This time is called the “worst case performance” of an algorithm. This is the worst case (longest time) that the brute force approach will take on this particular list of data.

If the built-in index worst case performance (500 ms) is fast enough for NOAA, then you can just use the list.index() method, instead of your own. If you want to beat the built-in index performance with your own code, you will need to use a smarter, more efficient algorithm. Binary search

Here’s another way to call the built-in list.index() method. You can use the list type rather an instance of a list object. The list class also has a .index() method. Just like for your find_position() function, the list you want to search should be the first positional argument, and the value you are searching for is the second argument:

>>>  list.index(minutes, 1_000_000)
1000000

>>>  %timeit list.index(minutes, 1_000_000)
6.75 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This doesn’t speed things up though. The list class also uses a brute force algorithm to find the value you ask for. This is where the power of algorithms can save you. A smart algorithm can leapfrog a faster computer by running 10x or more faster.

Imagine you want to search an alphabetized list of 9 names. You are looking for the name “Hal.” It probably makes sense to start in the middle of the list and see if that name is greater or less than the name you are looking for. If it’s greater than you can search the lower half of the list, and if it’s smaller (comes before), you can search the upper half. Sometimes this is called a “divide and conquer” approach to algorithm design. The idea is that you try to make your problem half as hard with each iteration through your loop.

Here’s a diagram of how binary search would work on a sorted list of only 9 names when you are looking for the string “Hal” among these names:

Binary search

To start the binary search you split the dataset in half and go to the middle at index 4 to find the value “Ell.” You check to see that your target value “Hal” is greater than the name “Ell” at that middle location in the data. This is because “Hal” would come after “Ell” in an alphabetized list of names. So your next step is to check the midpoint on the upper half of the list of data at position (4 + 8) // 2 which is index location 6. That gives you the value “Gil” which is also less than your target value “Hal.” So you find the midpoint between 6 and 8 with (6 + 8) // 2 which is 7. The value at 7 is the name “Hal” that you are looking for, so you return the index number of 7.

Perhaps now you can imagine how you might implement binary search in a Python function. Here’s one pretty advanced way of doing it:

>>> def find_position(sorted_list, value):
...     """Find the index integer for a value contained in a list."""
...     while len(sorted_list):
...         mid_index = len(sorted_list) // 2
...         mid_value = sorted_list[mid_index]
...         if value == mid_value:
...             return mid_index
...         elif value > mid_value:
...             offset = find_position(sorted_list[mid_index:], value)
...             if offset is None:
...                 return None
...             else:
...                 return offset + mid_index
...         return find_position(sorted_list[:mid_index], value)

This is an advanced kind of function, that is designed to call itself repeatedly until it finishes doing whatever it was supposed to do, in this case find the target value. In computer science, this kind of function is called a “recursive function”.

Use the %timeit function again to see

>>> %timeit find_position(timestamps, '2023-12-31 23:50')
987 ms ± 3.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit find_position(timestamps, '1880-01-01 23:00')
989 ms ± 3.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit timestamps.index('1880-01-01 23:00')
16.6 µs ± 345 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

>>> %timeit timestamps.index('2023-12-31 23:50')
902 ms ± 6.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Notice that the built-in index function is much faster than our fancy binary search algorithm, especially for values at the beginning of the list. This is one way to tell that the index() method is definitely using a sequential (brute force) scan of all the value in the list. It just happens across the values at the beginning of the list sooner than those at the end. The binary search algorithm finds the correct answer in almost exactly 1 second, every time. It doesn’t matter whether what you’re looking for is right at the beginning or all the way at the end, the binary search algorithm has approximately the same number of steps. And keep in mind that your Python algorithm is sandbagging (intentionally given a speed penalty) because the code here is in Python, but the index function is implemented in C which is almost 10x faster.

So multiply the index times by 10x to give 9000 ms (9 s) for that value near the end of the list. Nine seconds probably doesn’t seem like a big disadvantage compared to the binary search time of one second. But what if one of your customers has more than a billion data points rather than the 75 million that NASA has. Finding a value at the end of the billion data points would take around 15 seconds if you used the built-in index method. Your binary search algorithm would always take less than a second no matter how long the list gets.